Elena Tuzhilina
Oct 7, 2021
http://web.stanford.edu/~elenatuz/courses/stats32-aut2021/
dplyr
select(): pick variables/columns by their namesmutate(): create new variables/columns based on existing onesarrange(): reorder rowsfilter(): pick rows by their valuessummarize(): collapse many rows down to a single summarygroup_by(): perform operations at a group levelALL of these functions take:
ALL of these functions take:
The dataset is either:
%>%, e.g.ALL of these functions take:
The dataset is either:
%>%, e.g.ALL of these functions return a dataset!
You can do three things with this returned dataset:
%>%%>% syntax with dplyrTake the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15
gather() function in tidyr%>%tidyr packageAnother useful package for data manipulations.
Let’s consider a dataset: no. of cases for each country
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
gather() functionHow to make a line plot of no. of cases by year for each country?
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
gather() functionHow to make a line plot of no. of cases by year for each country?
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
Probably want something like
Problem: Column names are values of the variable year.
gather() functionHow to make a line plot of no. of cases by year for each country?
Solution: Reshape dataset
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
gather() functionHow to make a line plot of no. of cases by year for each country?
Solution: Reshape dataset using gather() function in tidyr
gather() functionHow to make a line plot of no. of cases by year for each country?
Solution: Reshape dataset using gather() function in tidyr
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
gather() functionHow to make a line plot of no. of cases by year for each country?
Solution: Reshape dataset using gather() function in tidyr
df <- df %>% gather(`1999`, `2000`, key = "year", value = "cases")
ggplot() +
geom_line(data = df, mapping = aes(x = as.numeric(year), y = cases, col = country))A function is a named block of code which
We’ve already seen a number of functions in R! For example,
## [1] TRUE
The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.
Others we’ve seen: str(), head(), sd(), ggplot(), select(), …
The most important syntax in R is the function call. All R syntax has function calls underlying it.
A function call consists of:
## [1] NA
## [1] -1
Function calls read “inside out”!
abs(x): computes absolute value of x.
mean(x): computes the average value for x.
## [1] 2.6
Function calls read “inside out”!
abs(x): computes absolute value of x.
mean(x): computes the average value for x.
## [1] 2.6
%>% vs direct function call%>% is implemented by the magrittr packagedplyr package is loaded, magrittr is loaded too%>% is “syntactic sugar”: makes code easier to understand%>% becomes the first argument in the function on the right of %>%## [1] 2.6
Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…
Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…
First answer: Google it! Google “R <function name>”
Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…
First answer: Google it! Google “R <function name>”
A (probably) better answer: Documentation in R itself!
We can see what a function does by typing in ? followed by the function name in the R console.
sample(): Descriptionsample(): UsageWhat comes after the = sign: default value for that argument
sample(): Argumentssample(): Detailssample(): Value## [1] 7 9 8 3 5 6 10 1 2 4
## [1] 5 3 10 6 9 4 2 7 8 1
## [1] 1 4 1 1 6 9 10 9 8 4
## [1] 4 2 7 1 10 9 6 8 3 5
## [1] 9 10 3 9 10 3 7 3 5 4
## [1] 9 6 3 3 9
Each function in R has
Each function in R has
Example: for given \(x\) and \(y\) compute \(x^2+y^2\).
For given \(x\) and \(y\) compute \(x^2+y^2\).
You can drop return.
For given \(x\) and \(y\) compute \(x^2+y^2\).
You can drop return.
The last line is the output.
Set all the arguments to some values.
For example,
## [1] 2
Set all the arguments to some values.
For example,
## [1] 2
If you know the order of parameters in the function, you can drop the parameter names.
## [1] 200
Write a function that computes \(x^y\) given \(x\) and \(y\).
Write a function that computes \(x^y\) given \(x\) and \(y\).
Write a function that computes \(x^y\) given \(x\) and \(y\).
Let’s try.
## [1] 1
## [1] 1000
For example, a vector.
For example, a vector.
Let’s test it.
## [1] 14
For example, a list.
For example, a list.
Let’s test it.
## $students
## [1] "Mary" "Bob" "Elena"
##
## $scores
## [1] 9 9 1
## [1] 6.333333
For example, a plot.
For example, a plot.
plot_cases <- function(df){
df <- df %>% gather(`1999`, `2000`, key = "year", value = "cases")
plot <- ggplot() +
geom_line(data = df, mapping = aes(x = as.numeric(year), y = cases, col = country))
return(plot)
}Let’s check!
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
Use list() for this inside the return().
Use list() for this inside the return().
Example: for each triple x, y, z return maximum, minimum and average.
max_min_avg <- function(x, y, z){
return(list(max = max(x, y, z), min = min(x, y, z), mean = mean(x, y, z)))
}## $max
## [1] 3
##
## $min
## [1] 1
##
## $mean
## [1] 1
You can store the result in a variable.
## [1] 3
Use list() for this inside the return().
Example: for each triple x, y, z return maximum, minimum and average.
max_min_avg <- function(x, y, z){
return(list(max = max(x, y, z), min = min(x, y, z), mean = mean(x, y, z)))
}## $max
## [1] 3
##
## $min
## [1] 1
##
## $mean
## [1] 1
You can store the result in a variable.
## [1] 3
Example: for each triple x, y, z return maximum, minimum and average.
max_min_avg <- function(x, y = 0, z = 0){
return(list(max = max(x, y, z), min = min(x, y, z), mean = mean(x, y, z)))
}If you skip an argument it is set to the default value.
## $max
## [1] 1
##
## $min
## [1] 0
##
## $mean
## [1] 1
## $max
## [1] 2
##
## $min
## [1] 0
##
## $mean
## [1] 1
None to D4: drought levels of increasing severity
Optional material
tidyr functions: gather and spreadgather: Used when some column names are not variables, but values of a variable
spread: Opposite of gather
tidyr functions: separate and uniteseparate: Used to separate values in one column into multiple columns
unite: Opposite of separate